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Abstract 

Unbalanced data arises in many learn- 
ing tasks such as clustering of multi-class 
data, hierarchical divisive clustering and 
semi-supervised learning. Graph-based ap- 
proaches are popular tools for these prob- 
lems. Graph construction is an important as- 
pect of graph-based learning. We show that 
graph-based algorithms can fail for unbal- 
anced data for many popular graphs such as 
fc-NN, e-neighborhood and full-RBF graphs. 
We propose a novel graph construction tech- 
nique that encodes global statistical infor- 
mation into node degrees through a rank- 
ing scheme. The rank of a data sample 
is an estimate of its p- value and is propor- 
tional to the total number of data samples 
with smaller density. This ranking scheme 
serves as a surrogate for density; can be 
reliably estimated; and indicates whether a 
data sample is close to valleys/modes. This 
rank-modulated degree(RMD) scheme is able 
to significantly sparsify the graph near val- 
leys and provides an adaptive way to cope 
with unbalanced data. We then theoretically 
justify our method through limit cut analy- 
sis. Unsupervised and semi-supervised exper- 
iments on synthetic and real data sets demon- 
strate the superiority of our method. 



1 Introduction and Motivation 

Graph-based approaches are popular tools for unsu- 
pervised clustering and semi-supervised (transductive) 
learning(SSL) tasks. In these approaches, a graph rep- 
resenting the data set is constructed. Then a graph- 



based learning algorithm such as spectral cluster- 
ing^) (PQ, [5]) or SSL algorithms (0, g], 0) is ap- 
plied on the graph. These algorithms solve the graph- 
cut minimization problem on the graph. Of the two 
steps, graph construction is believed to be critical to 
the performance and has been studied extensively ([5J, 
[7J, [5], [3], [ID]). In this paper we will focus on graph- 
based learning for unbalanced data. The issue of un- 
balanced data has independent merit and arises rou- 
tinely in many applications including multi-mode clus- 
tering, divisive hierarchical clustering and SSL. Note 
that while model-based approaches( [TT] ) incorporate 
unbalancedness, they typically work well for relatively 
simple cluster shapes. In contrast non-parametric 
graph-based approaches are able to capture complex 
shapes([2]). 

Common graph construction methods include e-graph, 
fully-connected RBF-weighted(full-RBF) graph and fc- 
nearest neighbor(fc-NN) graph. e-graph links two 
nodes u and v if d(u, v) < e. e-graph is vulnerable 
to outliers due to the fixed threshold e. Full-RBF 
graph links every pair with RBF weights w(u,v) = 
exp(— d(u, v) 2 /2a 2 ), which is in fact a soft threshold 
compared to e-graph(cr serves the similar role as e). 
Therefore it also suffers from outliers. fc-NN graph 
links u and v if v is among the fc closest neighbors of 
u or vice versa. It is robust to outlier and is the most 
widely used method. ([7], [BJ). In [pj the authors pro- 
pose 6-matching graph. This method is supposed to 
eliminate some of the spurious edges that appear in 
fc-NN graph and lead to improved performance!^ [TO]). 

Nevertheless, it remains unclear what exactly are spu- 
rious edges from statistical viewpoint? How does it 
impact graph-based algorithms? While it is difficult 
to provide a general answer, the underlying issues can 
be clarified by considering unbalanced and proximal 
clusters as we observe in many examples. 

Example: Consider a data set drawn i.i.d from 
a proximal and unbalanced 2D gaussian mixture 
density, £^ i=1 ctiN(iii, Si), where ai=0.85, a2=0.15, 
Mi=[4.5;0], M2=[-0.5;0], E x = diag(2, 1), £ 2 = I. FigUJ 
shows different graphs and clustering results. SC fails 
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Figure 1: Graphs and SC results of unbalanced 2-Gaussian density. Graph cut curves in (b) are properly rescaled. 
Full-RBF graph is not shown. Binary weights are adopted on e-graph, fc-NN, 6-matching and RMD graph. The clustering 
results on full-RBF and e-graph are exactly same. The valley cut is approximately xi ~ 1. SC fails on e-graph and 
full-RBF graph due to outliers. SC on fc-NN or 6-matching graph cuts at the balanced position due to impact of the 
balancing term. Our method significantly prunes spurious edges near the valley, enabling SC to cut at the valley, while 
maintaining robustness to outliers. 



on e-graph and full-RBF graph due to outliers; SC on 
fc-NN(6-matching) graph cuts at the wrong position. 

Discussion: To explain this phenomenon, we inspect 
SC from the viewpoint of graph-cut. G = (V, E) is 
the graph constructed from data set with n samples. 
Denote (C,C) as a 2-partition of V separated by a 
given hyperplane S. The simple cut is defined as: 

Cut(C,(7)= Yl w(u,v) (1) 

w£C,v£C,(u,v)£E 

where w(u,v) is the weight of edge (u,v) € E. To 
avoid the problem of singleton clusters, SC attempts 
to minimize the RatioCut or Normalized Cut(NCut): 

RatioCut(C, C) = Cut(C, C) (±- + (2) 

NCut(C, C) = Cut(C, C) ( — l — + —f—) (3) 

\vol(C) vol(C) / 

where |C| denotes the number of nodes in C, vol(C) = 
S«ec vgv w ( u i v )- Note that RatioCut(NCut) adds to 
the Cut (we will keep using Cut throughout our paper 
to denote the simple cut) with a balancing term, which 
in turn induces partitions to be balanced. While this 
approach works well in many cases, minimizing Ra- 
tioCut(NCut) on fc-NN graph has severe drawbacks 



when dealing with intrinsically unbalanced and proxi- 
mal data sets. 

As a whole, the existing graph construction methods 
on which the graph-based learning is based have the 
following weaknesses: 

(1) Vulnerability to Outliers: FigOJb) shows that 
both the RatioCut curves on e-graph(cyan) and full- 
RBF graph(green) and the Cut curve(black) are all 
small at the boundaries. This means in many cases 
seeking global minimum on these curves will incur sin- 
gle point clusters. It is relatively easy to see that the 
Cut value is small at boundaries since there is no bal- 
ancing term. However, it is surprising that even for 
this simple example, e-graph and full-RBF graph are 
vulnerable to outliers(FigOJc)), even with the balanc- 
ing term. This is because the Cut value is zero for 
e-graph and exponentially small for full-RBF. fc-NN 
graph does not suffer from this problem and is robust 
to outliers. 

(2) Balanced Clusters: The fc-NN graph while ro- 
bust results in poor cuts when the clusters are un- 
balanced. Note that the valley of the fc-NN Cut 
curve(black) corresponds to the density valley(xi = 1 
in FigQJb)) is the optimal cut in this situation. How- 
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ever, the red curve in FigUJb) shows that the mini- 
mum RatioCut on fc-NN graph(similar on ^-matching 
graph) is attained at a balanced positional = 3.5). 

Example Continued: To understand this drawback 
of fc-NN graph, consider a numerical analysis on this 
example. Assume the sample size n is sufficiently large 
and fc properly chosen such that fc-NN linkages approx- 
imate the neighborhoods of each node. Then there are 
roughly kns edges near the separating hyperplane S, 
where ns is the number of points along S. Empiri- 
cally RatioCut(iS') « kns (j^ + j^j on unweighted 

fc-NN graph. Let Si be the valley cut(|C|/|V| = 0.15), 
which corresponds to ii s» i; S 2 be the balanced 
cut(|C|/|V| = 0.5), X\ ~ 3.5. Let m and n 2 be the 
number of points within a fc-neighborhood of Si and S2 
respectively. It follows that RatioCut! s» 7.84kni/n 
and RatioCut 2 ~ Akn 2 /n. SC will erroneously prefer 
S2 rather than the valley cut Si if n 2 < 1.96n.i. Unfor- 
tunately, this happens very often in this example since 
the data set is unbalanced and proximal. 

We ascribe this observation to two reasons: 

(a) RatioCut (NCut) Objective: RatioCut (NCut) 
does not necessarily achieve minimum at density val- 
leys. The minimum is attained as a tradeoff between 
the Cut and the sizes of each cluster (balancing term). 
While this makes sense when the desired partitions 
are approximately balanced, it performs poorly when 
the desired partitions are unbalanced, as shown in 
FigHd),(e). 

(b) Graph Construction: fc-NN (^-matching) graph 
does not fully account for underlying density. Pri- 
marily, density is only incorporated in the number of 
data points along the cut. The degree of each node 
is essentially a constant and does not differentiate be- 
tween high/low density regions. This can be too weak 
for proximal data. Indeed, [12] also argues that fc-NN 
graph only encodes local information and the cluster- 
ing task is achieved by pulling all the local information 
together. In addition, we also observe this phenomena 
from the limit cut on fc-NN graph [8]: 

^■L^'Hm+izhs) (4) 

where /i(C) = J c f(x)dx, f the underlying den- 
sity, d the dimension. The limit of the simple Cut 
J s /(s)ds(even without the weakening power of d) only 
counts the number of points along S. There is no infor- 
mation about whether S is near valleys/modes. While 
other graphs do incorporate more density(the limits of 
Cut are J s f 2 (s)ds for e-graph [5] and J s f(s)ds for 
full-RBF graph [13]), they are vulnerable to outliers 
and their performance is unsatisfactory for both bal- 
anced and unbalanced data as in [10] and our simula- 



tions. 

Our Approach: The question now is how to find 
the valley cut while maintaining robustness to out- 
lier? At a high-level we would like to follow the den- 
sity curve while avoiding clusters that correspond to 
outliers. One option can be to improve the objective, 
for example, minimize cut 2 (5) (^jfjj + on fc-NN 
graph. However, it is unclear how to solve this prob- 
lem. We adopt another approach, which is to construct 
a graph that incorporates the underlying density and 
is robust to outliers. Specifically, we attempt to adap- 
tively sparsify the graph by modulating node degrees 
through a ranking scheme. The rank of a data sample 
is an estimate of its p- value and is proportional to the 
total number of samples with smaller density. This 
rank indicates whether a node is close to density val- 
leys/modes, and can be reliably estimated. Our rank- 
modulated degree(RMD) graph is able to significantly 
reduce (increase) the Cut values near valleys (modes), 
leading to emphasizing the Cut in RatioCut optimiza- 
tion, while maintaining robustness to outliers. More- 
over, our scheme provides graph-based algorithms with 
adaptability to unbalanced data. Note that success in 
clustering unbalanced data has implications for divi- 
sive hierarchical clustering with multiple clusters. 
In this scheme at each step a binary partition is at- 
tained by SC with possibly two unbalanced parts. 

The remainder of the paper is organized as follows. 
We describe our RMD scheme for learning with unbal- 
anced data in Sec|3J with more details in Sec l2.1hl!01 
The limit expression of RatioCut (NCut) for RMD 
graph is investigated in SecJ3] Experiments on syn- 
thetic and real datasets are reported in Sec0] 

2 RMD Graph: Main Idea 

We propose to sparsify the graph by modulating the 
degrees based on ranking of data samples. The rank- 
ing is global and we call the resulting graph the 
Rank-Modulated Degree(RMD) graph. Our graph is 
able to find the valley cut while being robust to out- 
liers. Moreover, our method is adaptable to data sets 
with different levels of unbalancedness. The process 
of RMD graph based learning involves the following 
steps: 

(1) Rank Computation: The rank R(u) of node 
(data sample) u is calculated: 

1 N 

R ( U ) = jyE'lGW^,)) ( 5 ) 
i=l 

where I denotes the indicator function, G(u) is some 
statistic of u. The rank is an ordering of the data sam- 
ples based on statistic G(-). 
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(2) RMD Graph Construction: Connect each 
node u to its deg(it) closest neighbors, where the de- 
gree is modulated through an afhne monotonic func- 
tion of the rank: 

deg(w) = k(\ + c/)(R(u))) (6) 

where k is the average degree, <fi is some monotonic 
function and AG [0, 1]. 

(3) Graph-based Learning Algorithms: Spec- 
tral clutering or graph-based SSL algorithms, which 
involves minimizing RatioCut(NCut), are applied on 
RMD graph. 

(4) Cross- Validation Scheme: Repeat (2) and (3) 
using different configurations of degree modulation 
scheme (Eq. ([6])). Among the results, pick the best 
which has the minimum simple Cut value. 

2.1 Rank Computation 

We have several options for choosing the statistic G(-) 
for rank computation. 

(1) e-Neighborhood: Choose an e-ball around u and 
set: G(u) — —N e (u), where N e (u) is the number of 
neighbors in an e radius of u. 

(2) /-Nearest Neighorhood: Here we choose the 
i-th nearest-neighbor distance of u: G(u) = D^(u). 

(3) Average /-Neighbor distance: The average of 
u's 4-th to y-th nearest-neighbor distances: 

G(u) = j ]T D (l) {u) (7) 

Empirically we observe the third option leads to bet- 
ter performance and robustness. For the purpose of 
implementation (see Sec. 4) we use the U-statistic re- 
sampling technique for reducing variance ([14]). 

Properties of Rank Function: 

(1) Valley/Mode Surrogate: The value of R(u) is 
a direct indicator of whether u lies near density val- 
leys. (see Figf2]). It ranges in [0,1] with small values 
indicating globally low density regions and large val- 
ues indicating globally high-density regions. 

(2) Smoothness: R(u) E [0, 1] is the integral of pdf 
asymptotically(see ThmfT] in Sec(3]); it's smooth and 
uniformly distributed. This makes it appropriate to 
globally modulate the degrees of nodes without wor- 
rying about scale and allows controlling the average 
degree. 

(3) Precision: We do not need our estimates to be 
precise for every point; the resulting cuts will typically 
depend on low ranks rather than the exact value, of 
most nearby points. 




pdf vs rank 



Figure 2: pdf and ranks of the unbalanced 2-Gaussian 
density of Fig[TJ High/low ranks correspond to density 
modes/ valleys. 



2.2 Rank-Modulated Degree Graph 

Based on the ranks, our degree modulation 
scheme (Eq. ([6])) successfully solves the two issues 
of traditional graphs, as well as additionally pro- 
vides the adaptability to data sets with different 
unbalancedness : 

(1) Emphasis of Cut term: The monotonic- 
ity of deg(zi) in R(u) immediately implies that nodes 
near valleys(with small ranks) will have fewer edges. 
FigOTb) shows that compared to fc-NN graph, RMD 
graph significantly reduces (increases) Cut (thus Ratio- 
Cut) values near valleys(modes), and RatioCut mini- 
mum is attained near valleys. FigQJf) demonstrates 
the remarkable effect of graph sparsification near den- 
sity valleys on RMD graph. 

Example Continued: Recall the example of FigfT] 
On RMD graph RatioCuti w 7.84fcini/n and 
RatioCut2 ~ ^kinijn^ where k\ and fc 2 & r e the aver- 
age degrees of nodes around the valley cut S\(x\ = 1) 
and the balanced cut Szfai = 3.5). Fig|2] shows the 
average rank near Si is roughly 0.2 and 0.7 near S^- 
Suppose the degree modulation scheme is deg(u) = 
fc(l/3 + 2R 2 (u)), then k x 0.41 A; and fc 2 « 1.31fc. 
Now S2 is erroneously preferred only if 712 < 0.61ni, 
which hardly happens! 

(2) Robustness to Outliers: The minimum degree 
of nodes in RMD graph is fcA, even for distant outliers. 
This leads to robustness as is shown in Figfljb) , where 
the RatioCut curve on RMD graph(blue) goes up at 
border regions, guaranteeing the valley minimum is 
the global minimum of RatioCut. Figfljf) also shows 
SC on RMD graph works well with the outlier. 

(3) Adaptability to Unbalancedness: The pa- 
rameters in Eq.© can be configured flexibly. With 



Jing Qian, Venkatesh Saligrama, Manqi Zhao 



our cross-validation scheme, this degree modulation 
scheme can offer graph-based learning algorithms with 
adaptability to data sets with different unbalanced- 
ness. We'll discuss this advantage in Sec l2.4l 

To construct RMD graph, we provide two algorithms. 
The first is similar to the traditional fc-NN: connect u 
and v if v is among the deg(w) nearest neighbors of u 
or u is among the deg(u) nearest neighbors of v. We 
determine the degree for each node based on Eq. [5] 
The time complexity is 0(n 2 logn). 

The second method solves an optimization problem to 
construct RMD graph: 

min VPijAj (8) 

s.t. = deg(xi),yi,j G 1, ...,n 

j 

Pu = 0,Py = Pji,Vi,j G 1, ...,n 

where D^j is the distance between Xi and Xj, Pij = 1 
indicates an edge between Xi and Xj. Different dis- 
tance metrics can be applied such as the Euclidean 
distance or other derived distance measures. 

To solve this optimization problem, we apply the 
Loopy Belief Propagation algorithm which is originally 
designed for the b- matching problem in [15] . They 
prove that max-product message passing is guaran- 
teed to converge to the true maximum MAP in 0(n 3 ) 
time on bipartite graphs; in practice the convergence 
can be much faster. More details can be found in [15] . 

2.3 Graph Based Learning Algorithms 

We apply several popular graph-based algorithms on 
various graphs to validate the superiority of RMD 
graph for unsupervised clustering and semi-supervised 
learning tasks. These algorithms all involving mini- 
mizing ti(F T LF) plus some other constraints or penal- 
ties, where F is the cluster indicator function or clas- 
sification(labeling) function, L is the graph Laplacian 
matrix. Based on spectral graph theory |16) . this is 
equivalent to minimizing RatioCut(NCut) for unnor- 
malized(normalized) L. 

Spectral Clustering(SC): Let n denotes the sam- 
ple size, c the number of clusters, Cj the j-th cluster, 
j = 1, ...,c, and L the unnormalized graph Laplacian 
matrix computed from the constructed graph. F is 
the cluster indicator matrix. Unnormalized SC(NCut 
minimization for normalized L is similar) aims to solve 
the following optimization problem: 

min tr(F T LF) 
F 

s.t. F T F = I, F as defined above. 



Details about SC can be found in [7J. 

Gaussian Random Fields(GRF): GRF aims to 
recover the classification function F = [Fi F U ] T by op- 
timizing the following cost function on the graph: 

min ti(F T LF) 

Fei?" xc 

s.t. LF U = 0, Fi = Yi 

where c is the number of classes, L denotes the un- 
normalized graph Laplacian, Y\ the labeling matrix. 
Details about GRF can be found in [3J. 

Graph Transduction via Alternating Minimiza- 
tion(GTAM): Based on the graph and label infor- 
mation, GTAM aims to recover the classification func- 
tion F = [FiF u ] T G R nxc where Fi (F u ) corresponds 
to the labeled (unlabeled) data. Specifically, GTAM 
solves the following problem: 

min tr(F T LF + fi(F - VF) T (F - VY)) 
s.t. Y^Y,, 1 

3 

where V is a node regularizer to balance the influence 
of labels from different classes. Details about GTAM 
can be found in [S]. 

2.4 Adaptability and Cross Validation 
Scheme 

Parameters in Eq.® can be specified flexibly. Our 
overall approach is to keep the average degree at k, 
while sparsifying the graph differently to cope with 
unbalanced data. Note that R(u) is uniformly dis- 
tributed within [0, 1] and 4> is any monotonic function. 
For example, if <j) equals identity and A = 0.5, the node 
degrees deg(u) = fc(0.5 + i?(u)) and the degree range is 
[hk, |fc]; another example is deg(u) = fc(l/3 + 2i? 2 (u)) 
with node degrees in 

Generally speaking, RMD graphs with small dynamic 
ranges of node degrees lead to less flexibility in dealing 
with unbalanced data but are robust to outlicrs(fc-NN 
is the extreme case where all nodes have same degrees). 
RMD graphs with large dynamic range lead to sparser 
graphs and can adapt to unbalanced data. Neverthe- 
less it is sensitive to outliers and can incur clusters of 
very small sizes. 

Based on the analysis in Sec.l and Fig. 1(b), we wish 
to find minimum cuts with sizable clusters. The most 
important aspect of our scheme is that it adapts to 
different levels of unbalancedness. We can then use a 
cross validation scheme to select appropriate cuts. 

In our simulation, the cross-validation scheme is sim- 
ple. Among all the partitions obtained by graph- 
based algorithms on several different RMD graphs, we 
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first discard those results with clusters of sizes smaller 
than a certain threshold (outliers influence the results). 
Among the rest, we pick the partition with the mini- 
mum Cut value. 

3 Analysis 

Assume the data set D = {x\, ...,x n } is drawn i.i.d. 
from density / in M. d . f has a compact support C. Let 
G = (V, E) be the RMD graph. Given a separating 
hyperplane S, denote C + ,C~ as two subsets of C split 
by S, r]d the volume of unit ball in M. d . We first show 
that our ranking acts as an indicator of valley/mode. 

Regularity condition: /(•) is continuous and lower- 
bounded: f(x) > fmin > 0. It is smooth, i.e. 
||V/(a;)|| < A, where V/(x) is the gradient of /(•) 
at x. Flat regions are disallowed, i.e. Vx G X, Vcr > 0, 
{y '■ \ f(y) ~ f( x )\ < a } — Ma, where M is a con- 
stant. 

First we show the asymptotic consistency of the rank 
R(u). The limit of R(u), p(u), is exactly the comple- 
ment of the volume of the level set containing u. Note 
that p(u) is small near valleys. It is also smoother than 
pdf, and scales in [0,1]. The proof can be found in the 
supplementary material. 

Theorem 1. Assume the density f satisfies the above 
regularity assumptions. For a proper choice of param- 
eters ofG{u), we have, 



satisfied almost surely. (C& = 
J c± f(x)dx.) 



Proof. We only present a brief outline of the proof. 
We want to establish the convergence result of the cut 
term and the balancing terms respectively, that is: 

_L d /Z cutn (s) _> Cd f / 1 -^( s ) P (,s) 1 +^d s . (ii) 

1 1 

(12) 



ik, 



vol(£>±) n{C±)' 



where £>+(£>-) = {x G D : x G C+(C-)} are the 
discrete version of C + (C~). 

Equation [TT] is established in two steps. First we can 
show that the LHS cut term converges to its expecta- 
tion E ^/^cut„(S')^ by making use of the con- 
centration of measure inequality [17j . Second we show 
that this expectation term actually converges to the 
RHS of Equation [TTJ This is the most intricate part 
and we state it as a separate result in Lemma [3] 



For equation rT2] recall that the volume term of D + 
is vol(Z? + ) = J2 ueD+ veD 1. It can be shown that as 
n — > oo, the distance between any connected pair (u, v) 
goes to zero. Next we note that the number of points in 
D + is binomially distributed Binom(n, /i(C + )). Using 
the Chernoff bound of binomial sum we can show that 
almost surely Equation [12] holds true. □ 



R(u) -> p(u) 



{x:f(x)<f(u)} 



f(x)dx (9) Lemma 3. Given the assumptions of Theorem^ 



Next we study graph-cut induced on unweighted RMD 
graph. We show that the limit cut expression on RMD 
graph involves a much stronger and adjustable term. 
This implies the Cut values near modes can be signif- 
icantly more expensive relative to those near valleys. 
For technical simplicity, we assume RMD graph ideally 
connects each point x to its deg(x) closest neighbors. 

Theorem 2. Assume the smoothness assumptions in 
£5]/ hold for the density f , and S is a fixed hyperplane 
in Mr. For unweighted RMD graph, let the degrees 
of point x be: deg(x) = k n (X + 4>{Ri{x))), where A 
is the constant bias, and denote the limiting expres- 
sion p{x) := (A + 4>(p(x))). Assume k n /n 0. In 
case d—1, assume h n j\fn — > oo; in case d >2 assume 
k n /logn — > oo. Then as n — » oo we have that: 

~-NCut n (S) 

c d I f 1 -Hs)p( s y+ids( f i(c+)-'+ fl (c-)- 1 ). 

Js 

(10) 



**ff cutn (S)) -^C d ( f 1 -Hs)p(s) 1+ Us. 



E 

where Cd 



T+TTd ■ 



4 Simulations 

We present experiments to show the power of RMD 
graph. We focus on the unbalanced settings by sam- 
pling the data set in an unbalanced way. First, we use 
U-statistic resampling technique and obtain averaged 
ranks with reduced variance 1141. 



U-statistic Resampling For Rank Computation: 

We input n = 2m data points, nearest-neighbor pa- 
rameter I, number of resampling times b. We then 
Randomly split the data set into two equal parts: 
Si = {xi,.-,x m }, S 2 = {x m+1 ,...,x 2m }- Then points 
points in S 2 are used to calculate the statistics G(xi) 
of Xi G S\ and vice versa. The ranks of xi G S\ within 
S\ and ranks of Xi G S 2 within 52, based on Eq|5] 
We then resample and repeat the steps b times and 
average to obtain averaged ranks for the data points. 
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The algorithm turns out to be robust to the nearest- 
neighbor parameter I. In our simulations we select I 
to be of the order of y/m. We set the parameter of re- 
sampling times b — 10 in our experiments. Note that 
the complexity of rank calculation is 0(bn 2 \ogn). 

Other general parameters are specified below: (1) In 
the rank R(u), we choose the statistic G(u) as in 
Eqn|7] The ranks are quite robust to the choice of 
parameter I; here we fix I = 50. 

(2) In the step of U-statistic rank calculation, we fix 
the resampling times b = 10. 

(3) We adopt three RMD schemes: (a) deg(u) = 
fc(l/2 + R(u)); (b) deg(u) = fc(l/3 + 2R 2 (u)); (c) 
deg(u) = fc(l/4 + 3i? 3 (w)). The average degree fc(same 
as in fc-NN or 6-matching) will be specified later. 

(4) For graph construction, both methods described 
in Sec l2.2l are applied to build RMD graphs. We find 
the performances are almost the same. So for consid- 
eration of complexity, we recommend the simple fc-NN 
style algorithm to build RMD graphs. 

(5) For cross-validation, we find all clusters returned 
on RMD graphs of step (3) are sizable. Among these 
we pick the partition with the minimum Cut value. 

(6) All error rate results are averaged over 20 trials. 

4.1 Synthetic Experiments 




(a) SC on various graphs (b) GTAM on various graphs 

Figure 4: SC and GTAM on unbalanced pdf of Fig[TJ 
RMD graph outperforms fc-NN or b-matching graph. 

2D Examples: We revisit the example discussed in 
SecED We apply SC, GRF and GTAM on this unbal- 
anced density; the results are shown in FigUJ Sam- 
ple size N = 500; binary weights are adopted. For 
GRF and GTAM there are 20 randomly chosen la- 
beled samples guaranteeing at least one from each clus- 
ter, and GTAM parameter fi = 0.05. In both experi- 
ments RMD graph significantly outperforms fc-NN and 
^-matching graphs. Notice that different RMD graphs 
have different power to cope with unbalanced data. 

Multiple Cluster Example: Consider a data set 
composed of 2 gaussian and 1 banana-shaped proximal 
clusters. We manually add an outlier point. SC fails on 
e-graph(similar on full-RBF graph) because the outlier 
point forms a singleton component. SC cuts at the 
balanced positions on fc-NN(&-matching) graph instead 



of the valley on RMD graph. 

Divisive Hierarchical Clustering: For situations 
that require a structural view of the data set, we pro- 
pose a divisive hierarchical way of performing SC. This 
is possible because our graph sparsification accounts 
for unbalanced data and so we can use spectral cluster- 
ing on RMD for divisive clustering. At every step, the 
algorithm tries to split each existing part into 2 clus- 
ters and computes the corresponding graph cut values. 
The part with the smallest binary cut value is split un- 
til the expected number of clusters is reached. FigJS] 
shows a synthetic example composed of 4 clusters. SC 
on fc-NN graph fails at the first cut due to unbalanced- 
ness. On RMD graph at each step the valley cut is at- 
tained for the sub-cluster from the previous step with 
the smallest RatioCut value. 

4.2 Real DataSets 

We focus on 2-cluster unbalanced settings and consider 
several real data sets from UCI repository. We con- 
struct fc-NN, fe-match, full-RBF and RMD graphs all 
combined with RBF weights, but do not include the e- 
graph because of its overall poor performance. For all 
the experiments, sample size n — 750, average degree 
k = 30. The RBF parameter a is set to be the aver- 
age fc-NN distance. GTAM parameter \i = 0.05([TO]). 
For SSL algorithms 20 randomly labeled samples are 
chosen with at least one from each class. 

Varying Unbalancedness: We start with a compar- 
ison, for 8vs9 of the 256-dim USPS digit data set. We 
keep the total sample size as 750, and vary the unbal- 
ancedness, i.e. the proportion of numbers of points 
from two clusters, denoted by n%,ng. Fig |6] shows 
that as the unbalancedness increases, the performance 
severely degrades on traditional graphs. RMD graph 
with deg(u) = fc(l/3 + 2R 2 (u)) has a stronger effect 
of emphasizing the Cut, and thus adapts to unbal- 
ancedness better than RMD graphs with deg(u) = 
fc(l/2 + i?(w)). 

Other UCI Data Sets: We fix 150/600 samples of 
two classes which amounts to an unbalancedness frac- 
tion of 1 : 4 for several other UCI data sets. Results for 
RMD graph are obtained through the cross-validation 
scheme(Sec l2.4[) . Tab lll2l shows that our method con- 
sistently outperforms other graphss, sometimes over- 
whelmingly when the data set is unbalanced. 
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(a) e-graph (b) fc-NN graph (c) RMD graph 

Figure 3: 2 gaussian and 1 banana-shaped proximal data set. SC fails on e-graph due to the outlier cluster, cuts at the 
balanced positions on fc-NN graph while at the valley on RMD graph. 




(a) 1st partition(fc-NN) (b) 1st partition(RMD) (c) 2nd partition(RMD) (d) 3rd partition(RMD) 

Figure 5: Divisive Hierarchical SC is performed on fc-NN and RMD graph of 4-cluster data. Binary cuts are performed 
until the number of clusters reaches 4. Notice every step involves unbalanced data. fc-NN graph fails at the first step. On 
RMD graph valley cuts are attained at each step. 




Figure 6: SC, GRF and GTAM on 8vs9 of USPS digit dataset with various mixture proportions. fc-NN or 6-matching 
graphs fail. RMD graph with deg(u) = fe(l/3 + 2R 2 (u)) can adapt to unbalanced data better than with deg(it) = 
fe(l/2 + #(«))• 
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Table 1: SC on various graphs of unbalanced real data sets. Results of RMD are obtained after cross-validation. RMD 
graph performs significantly better than other methods. 
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Table 2: GRF and GTAM on various graphs of unbalanced UCI data sets. Results of RMD are obtained after cross- 
validation. RMD graph performs significantly better than other methods. 
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GRF-kNN 


11.53 
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APPENDIX: SUPPLEMENTARY MATERIAL 

For ease of development, let n — m,\(wi2 + 1), and divide n data points into: D = Do |J D\ |J ••• |J D mi , where 
Do = {xi, x mi }, and each Dj,j — involves m,2 points. Dj is used to generate the statistic G for u 

and Xj £ Do, for j = 1, m\. Do is used to compute the rank of u: 

R(u) = — X1 II {G(^;D J )>G(«;D 3 )} 
1 J=l 



We provide the proof for the statistic G(u) of the following form: 

i m* 

G( U ;^) = - £ •) (13) 

where £)( ^ (u) denotes the distance from u to its t-th nearest neighbor among m.2 points in Dj . Practically we 
can omit the weight as EqujS] in the paper. The proof for the first and second statistics can be found in [Zhao 
& Saligrama, 2008]. 

Proof of Theorem 1: 



Proof. The proof involves two steps: 



1. The expectation of the empirical rank E [R{u)\ is shown to converge to p{u) as n — > oo. 

2. The empirical rank R(u) is shown to concentrate at its expectation as n — > oo. 

The first step is shown through Lemmg(3J For the second step, notice that the rank R(u) — ^- X)j=\ Yj> where 
Yj = ^{Gixj-D^yGiu-.Dj)} is independent across different j's, and Yj G [0, 1]. By Hocffding's inequality, we have: 

P (\R(u) - E [R(u)} | > e) < 2 cxp (-2mie 2 ) 

Combining these two steps finishes the proof. □ 



Proof of Lemma 3: 



Proof. The proof argument is similar to |18j and we provide an outline here. The first trick is to define a cut 
function for a fixed point Xi € V + , whose expectation is easier to compute: 



cut Xi = ^ w(xi,v). 



(14) 



Similarly, we can define cut^ for x% € V . The expectation of cut Xi and cut„(5) can be related: 

E(cut„(5)) = nE x (E(cut x )) (15) 
Then the value of E(cut Xi ) can be computed as, 



(n-1) 



/(y)dy 



B(x„r)nC- 



dF Rt (r). 



where r is the distance of Xi to its k n p(xi)-th nearest neighbor. The value of r is a random variable and can be 
characterized by the CDF F R k (r) . Combining equation [15] we can write down the whole expected cut value 

E(cut„(S')) = nE x (M mt x)) = n f (x)E(mt x )dx 



n(n-l) / f(x) 



g(x,r)dF R k(r) 



dx. 
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To simplify the expression, we use g(x,r) to denote 

7 BMnc - f(y) d v x ^ c+ 



g(x,r) 



B(x,r)nC+ 



f(y)dy xeC- 



Under general assumptions, when n tends to infinity, the random variable r will highly concentrate around its 
mean E(r*). Furthermore, as k„ /n — > 0, E(r£) tends to zero and the speed of convergence 



E(r k x )^(kp(x)/((n-l)f(x)r ]d )) 1 / d 
So the inner integral in the cut value can be approximated by g(x,E(r k ,)), which implies, 

E(cut„(5)) « n(n - 1) / E(r£))da;. 



(16) 



The next trick is to decompose the integral over M. d into two orthogonal directions, i.e., the direction along the 
hyperplane S and its normal direction (We use n to denote the unit normal vector): 



f(x)g(x,E(r k x ))dx 

+ 00 

S J — oo 



f(s + tlt)g(s+tlt, E(r k ; +tlt ))dtd S . 



When t > E(rj\_ { -^), the integral region of g will be empty: B(x,E(r k )) n C 
x = s + in is close to s G S, we have the approximation f(x) ~ /(s): 



On the other hand, when 



f(s + tlt)g(s + tlt,E(r k s+t ^))dt 



E(rJ) 



/(«) [/(s)vol + tit, Erf) n C - )] dt 

E(rJ) 



= 2/ 2 (s) / vol(fl(« + t"»7,E(rJ))nC-)dt. 
Jo 

The term vol (-B(s + t n , E(rJ)) n C — ) is the volume of d-dim spherical cap of radius E(rJ)), which is at distance 
i to the center. Through direct computation we obtain: 



vol (B(s + t^,E(r s fc )) n C") di = E(rJ) 



d+1' 



Combining the above step and plugging in the approximation of E(rf) in Equation 1161 we finish the proof. □ 
Lemma 4. By choosing I properly, as rri2 — > oo, it follows that, 

\E[R(u)} - p(u)\ — -> 

Proof. Take expectation with respect to D: 



E D [R{u)} = E D \ Do 



Er 



7711 

— ^hciu-D.XGixy^, 



)} 



3=1 



^ mi 

— 2]E X . [E D . [ijG^DjX^iDj)}]] 



Ex <G(x;Ui))] 



(17) 

(18) 
(19) 
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The last equality holds due to the i.i.d symmetry of {xi, x mi } and D\, D mi . We fix both u and x and 
temporarily discarding E^. Let F x {y\, y m2 ) = G{x) — G{u), where yi,-..,y m2 are the m 2 points in D\. It 
follows: 

V Dl (G(u) < G{x)) = V Dl {F x { yi , ...,y m2 ) > 0) = V Dl (F x - EF X > —EF X ) . 



To check McDiarmid's requirements, we replace yj with y'-. It is easily verified that Vj = 1, ...,m2, 



/ — 
\F x [yi,...,y m2 ) - F x {y 1 ,...,y j ,...,y m2 )\ < 2 d — < — 



(20) 



where C is the diameter of support. Notice despite the fact that y% , y m2 are random vectors we can still apply 
MeDiarmid's inequality, because according to the form of G, F x (yi, y m2 ) is a function of m 2 i.i.d random 
variables n, r m2 where is the distance from x to yi. Therefore if EF X < 0, or EG(x) < EG(w), we have by 
McDiarmid's inequality, 



V Dl (G(u) < G(x)) = V Dl (F x > 0) = V Dl {F x - EF X > -EF X ) < exp 



(EF x ) 2 f 
8C 2 m 2 



Rewrite the above inequality as: 



I{ef x >o} - e < V Dl (F x > 0) < I {E ^>o} + e 



(21) 



It can be shown that the same inequality holds for EF X > 0, or EG(x) > EG(m). Now we take expectation with 
respect to x: 



V x (EF X > 0) - E x 



< E [P Dl {F x > 0)] < V x (EF X > 0) + E x 



g 8C 2 m 2 



(22) 



Divide the support of x into two parts, Xi and X 2 , where X x contains those x whose density f(x) is relatively 
far away from f(u), and X 2 contains those x whose density is close to f{u). We show for i 6 Xj, the above 
exponential term converges to and V (EF X > 0) — V x (/("«) > f(x)), while the rest x £ X 2 has very small 

/ \l/d 

measure. Let A(x) = ( f/ x ~Z m j ■ By Lemma [5] we have: 



|EG(,-S : ( — V A(x) < 7 ( — ) " ( -r- 

\rri2 J V m 2/ \Jm1nCdm2 



7i W I 

1/ 



rn-2 



where 7 denotes the big O(-), and 71 = 7 ( j - ) ■ Applying uniform bound we have: 



A(x) - A(u) -2 1 -L 



7i_ 

/d 



m 2 



< E [G(x) - G(u)] < A(x) ~ A(u) + 2 ^ 



7i 



Id 



Cd j v m 2 



I \ 3 



Now let Xi = {x : \f(x) - f(u)\ > S^df^ "}. For x G Xi, it can be verified that \A(x) - A(u)\ > 

3 (^) (^ 2 -) d > or |E [G(x) - G(u)] I > (77?) (^j) d , and I {/(u)>/(x)} = I {eg(:e)>eg(u)} . For the exponential 



term in Equ.(|21j) we have: 



(EF x ) 2 l 2 \ 
,X,,! '-^G^ 2 - ) - eX] ' 



7f< 



(23) 



For 2; G X 2 = : — < 37id(^-J / m ^ n }, by the regularity assumption, we have "P(X 2 ) < 
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/ \ ^ d+l 

3Mjid ( J f m d in ■ Combining the two cases into Equ. (|22|) we have for upper bound: 

E D [R(u)\ = E x [P Dl (G(u)<G{x))} 

V Dl (G(u) < G(x)) f{x)dx + f V Dl (G(u) < G(x)) f{x)dx 



< V x (f(u) > f{x)) + cxp 71 ; " T(x e X x ) + V(x e X 2 ) 

V V 8C 2 c]m 2 + ~ d J J 

( ^2,2+| \ d+1 / 7 \ 3 

< P a (/(«) > /(or)) + exp r \ + 3M 7l dfJ- n — 

\ 8C 2 c*m 2 + * J V m2 / 

Let I — 171% such that 2d+A < a < L an d the latter two terms will converge to as m% — > oo. Similar lines hold 
for the lower bound. The proof is finished. □ 

/ \ 1 / d f \ 1 / d 

Lemma 5. Let A(x) — ( mc ^f( x ) ) > ^ ~ J A ( c d / 5 ) ■ ^ choosing I appropriately, the expectation of 
l-NN distance EDm(i) among m points satisfies: 



\ED (l) (x) - A(x)\ =0 A(x)X 



I N 1/dN 



Proof. Denote r(a;,a) = min{r : V (B(x,r)) > a}. Let S m —> as m — > oo, and < S m < 1/2. Let 
U ~ Bin(m, (1 + <$ m )— ) be a binomial random variable, with EJ7 = (1 + 5 m )l. We have: 



P^ (1) (x)>r(i,(l+y-|-)j = V{U<1) 



< exp 



2(1 + <5m). 

The last inequality holds from Chernoff's bound. Abbreviate ri = r(x, (l + 5 m )— ), and ED(;)(a;) can be bounded 



as: 



ED {l) (x) < r 1 [l-V(D (l) (x)>r 1 )]+CV(D (l) (x)>r 1 ) 
< r\ + C exp 



2(1 + <U. 

where C is the diameter of support. Similarly we can show the lower bound: 



ED (l) (x) > r(x, (1 - 5 m )L) - Cexp (-^^)) 



Consider the upper bound. We relate r% with A(x). Notice V (B(x,n)) = (1 + <5 m ) — > c d r ifmin, so a fixed 
but loose upper bound is ri < f ^j"* 5 '")' ^ = r max . Assume I /mis sufficiently small so that r% is sufficiently 
small. By the smoothness condition, the density within Bfx, ri) is lower-bounded by f(x) — Ari, so we have: 

P(B(x,n)) = (i + S m )- 

m 

> c d r d 1 (f(x)^Xr 1 ) 
= Cd r d 1 f(x)(l--±-r 1 



/(*) 

> c d rff(x) I 1 - -^—r. 



J 

f max 

/ man 
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That is: 

n < m (-^-) 

\ fmin rnax / 

Insert the expression of r max and set Ai = j— ( c ^ b . J > we have: 



i/d 

l + 5 m \ A . ~ ( SLl 



EDn^-Aix) < A(x)\ -^T77 -1 +Cexp 



l_ Al (±)V^ J -V 2(l + <5 ro ) 



.l-Ar^ J 2(1 + S m ) 



1- Xl (±f d 2(1 + u 



3d+8 _ j_ 

The last equality holds if we choose I — m 4d + 8 and o m = m 4 . Similar lines follow for the lower bound. Combine 
these two parts and the proof is finished. 

□ 



